Text mining in an example

  • Garry works at Bol.com (a webshop in the Netherlands)

  • He works in the dep of Customer relationship management.

  • He reads customers’ reviews (comments), extracts aspects they wrote their reviews on, and identifies their sentiments.

  • Curious about his job? See two examples!

This is a nice book for both young and old. It gives beautiful life lessons in a fun way. Definitely worth the money!

+ Educational

+ Funny

+ Price


Nice story for older children.

+ Funny

- Readability

Example

  • Garry likes his job a lot, but sometimes it is frustrating!

  • This is mainly because their company is expanding quickly!

  • Garry decides to hire Larry as his assistant.

Example

  • Still, a lot to do for two people!

  • Garry has some budget left to hire another assistant for couple of years!

  • He decides to hire Harry too!

  • Still, manual labeling is labor-intensive!

Challenges?

  • What are the challenges Garry, Larry, and Harry encounter in doing their job, when working with text data?

Challenges with text data

  • Huge amount of data

  • High dimensional but sparse

    • all possible word and phrase types in the language!!

Challenges with text data

  • Ambiguity

Challenges with text data

  • Noisy data

    • Examples: Abbreviations, spelling errors, short text
  • Complex relationships between words

    • “Hema merges with Intertoys”

    • “Intertoys is bought by Hema”

Back to the story

Example

  • During one of the coffee moments at the company, Garry was talking about their situation at the dep of Customer relationship management.

  • When Carrie, her colleague from the IT department, hears the situation, she offers Garry to use Text Mining!!

  • She says: “ Text mining is your friend; it can help you to make the faster by filtering and recommending possible words…

  • She continues : “Text mining is a subfield of AI and NLP and is related to data science, data mining and machine learning. It will make the process faster and cuts some of the expenses!”

  • After consulting with Larry and Harry, They decide to give text mining a try!

Example

Text mining definition?

  • Which can be a part of Text Mining definition?
    • The discovery by computer of new, previously unknown information from textual data
    • Automatically extracting information from text
    • Text mining is about looking for patterns in text
    • Text mining describes a set of techniques that model and structure the information content of textual sources


(You can choose multiple choices)

Go to www.menti.com and use the code 22 07 62 0

Text mining definition

  • “the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources” Hearst (1999)

  • Text mining is about looking for patterns in text, in a similar way that data mining can be loosely described as looking for patterns in data.

  • Text mining describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources. (Wikipedia)

Logistics

Access

Program

Time Monday Tuesday Wednesday Thursday
9:00 - 10:30 Lecture 1 Lecture 3 Lecture 5 Lecture 7
Break Break Break Break
10:45 – 11:45 Practical 1 Practical 3 Practical 5 Practical 7
11:45 – 12:30 Discussion 1 Discussion 3 Discussion 5 Discussion 7
12:30 – 14:00 Lunch Lunch Lunch Lunch
14:00 – 15:30 Lecture 2 Lecture 4 Lecture 6 Lecture 8
Break Break Break Break
15:45 – 16:30 Practical 2 Practical 4 Practical 6 Practical 8
16:30 – 17:00 Discussion 2 Discussion 4 Discussion 6 Discussion 8

Goal of the course

  • The course teaches students the basic and advanced text mining techniques using Python on a variety of applications in many domains of science.

Python?

  • From 1 to 5 how familiar are you with Python?

Python IDE?!

  • Which Python IDE do you mostly use?

  • From 1 to 10 how familiar are you with Google Colab?

Python

Google Colab

Google Colab?!

  • From 1 to 5 how familiar are you with Google Colab?

Python: Quick & Easy to Learn!

Code in Colab

Do you have any questions?

  • During the lecture

    • Post your question to the chat; we will read them during a break.
  • During the computer lab

    • Post your question in general; we will answer them.
  • After the lecture

    • Feel free to send me an email or text me in Microsoft Teams

Introduction

Ayoub Bagheri, a.bagheri@uu.nl

Utrecht Summer School: Applied Text Mining

Another TM definition

Text mining process

Text mining tasks

  • Text classification
  • Text clustering


We will also cover:

  • Sentiment analysis
  • Feature selection
  • Topic modelling
  • Word embedding
  • Deep learning models
  • Responsible text mining
  • Text summarization

Text classification

  • Supervised learning
  • Human experts annotate a set of text data
    • Training set
  • Learn a classification model
Document Class
Email1 Not spam
Email2 Not spam
Email3 Spam

Text classification?

  • Which problem is not a text classification task? (less likely to be)

    • Author’s gender detection from text

    • Finding about the smoking conditions of patients from clinical letters

    • Grouping news articles into political vs non-political news

    • Classifying reviews into positive and negative sentiment


Go to www.menti.com and use the code 86 08 86 5

Text clustering

  • Unsupervised learning
  • Finding Groups of Similar Documents
  • No labeled data
Document Cluster
News article1 ?
News article2 ?
News article3 ?

Text Clustering?

  • Which problem is not a text clustering task? (less likely to be)

    • Grouping similar news articles

    • Grouping discharge letters in two categories: heart disease vs cancer

    • Grouping tweets which support Trump into three undefined subgroups

    • Grouping online books of a library in 10 categories


Go to www.menti.com and use the code 86 08 86 5

Text preprocessing

Vector space model

  • Represent documents by concept vectors

    • Each concept defines one dimension

    • \(k\) concepts define a high-dimensional space

    • Element of vector corresponds to concept weight

      • E.g., \(d=(x_1,…,x_k)\), \(x_i\) is “importance” of concept \(i\) in \(d\)

Vector space model

  • Represent documents by concept vectors

    • Each concept defines one dimension

    • \(k\) concepts define a high-dimensional space

    • Element of vector corresponds to concept weight

      • E.g., \(d=(x_1,…,x_k)\), \(x_i\) is “importance” of concept \(i\) in \(d\)
  • Distance between the vectors in this concept space

    • Relationship among documents

An illustration of VS model

  • All documents are projected into this concept space

An illustration of VS model

  • All documents are projected into this concept space

Bag-of-Words representation

  • Term as the basis for vector space

Tokenization

  • Break a stream of text into meaningful units

    • Tokens: words, phrases, symbols

      • Input: It’s not straight-forward to perform so-called “tokenization.”

Tokenization

  • Break a stream of text into meaningful units

    • Tokens: words, phrases, symbols

      • Input: It’s not straight-forward to perform so-called “tokenization.”

      • Output(1): ‘It’s’, ‘not’, ‘straight-forward’, ‘to’, ‘perform’, ‘so-called’, ‘“tokenization.”’

      • Output(2): ‘It’, ’’‘, ’s’, ‘not’, ‘straight’, ‘-’, ‘forward, ’to’, ‘perform’, ‘so’, ‘-’, ‘called’, ‘“‘, ’tokenization’, ‘.’, ’”‘

Tokenization

  • Break a stream of text into meaningful units

    • Tokens: words, phrases, symbols

      • Input: It’s not straight-forward to perform so-called “tokenization.”

      • Output(1): ‘It’s’, ‘not’, ‘straight-forward’, ‘to’, ‘perform’, ‘so-called’, ‘“tokenization.”’

      • Output(2): ‘It’, ’’‘, ’s’, ‘not’, ‘straight’, ‘-’, ‘forward, ’to’, ‘perform’, ‘so’, ‘-’, ‘called’, ‘“‘, ’tokenization’, ‘.’, ’”‘

    • Definition depends on language, corpus, or even context

Tokenization

  • Solutions

    • Regular expressions

      • []+: so-called -> ‘so’, ‘called’

      • []+: It’s -> ‘It’s’ instead of ‘It’, ‘’s’

Tokenization

Bag-of-Words with N-grams

  • N-grams: a contiguous sequence of N tokens from a given piece of text
    • E.g., ‘Text mining is to identify useful information.’
    • Bigrams: ‘text_mining’, ‘mining_is’, ‘is_to’, ‘to_identify’, ‘identify_useful’, ‘useful_information’, ‘information_.’
  • Pros: capture local dependency and order
  • Cons: a purely statistical view, increase the vocabulary size \(O(V^N)\)

Automatic document representation

  • Represent a document with all the occurring words

    • Pros

      • Preserve all information in the text (hopefully)

      • Fully automatic

    • Cons

      • Vocabulary gap: cars v.s., car, talk v.s., talking

      • Large storage: N-grams needs \(𝑂(𝑉^𝑁)\)

    • Solution

      • Construct controlled vocabulary

A statistical property of language

  • Zipf’s law

    • Frequency of any word is inversely proportional to its rank in the frequency table

A statistical property of language

A statistical property of language

  • Zipf’s law

    • Frequency of any word is inversely proportional to its rank in the frequency table

    • Formally

      • \(f(k;s,N)=\frac{1/k^S}{\sum_{n=1}^N{1/n^S}}\)

        where \(k\) is rank of the word; \(N\) is the vocabulary size; \(s\) is language-specific parameter

A statistical property of language

  • Zipf’s law

    • Frequency of any word is inversely proportional to its rank in the frequency table

    • Formally

      • \(f(k;s,N)=\frac{1/k^S}{\sum_{n=1}^N{1/n^S}}\)

        where \(k\) is rank of the word; \(N\) is the vocabulary size; \(s\) is language-specific parameter

    • Simply: \(f(k;s,N) \propto 1/k^s\)

A statistical property of language

  • Zipf’s law

    • Frequency of any word is inversely proportional to its rank in the frequency table

    • Formally

      • \(f(k;s,N)=\frac{1/k^S}{\sum_{n=1}^N{1/n^S}}\)

        where \(k\) is rank of the word; \(N\) is the vocabulary size; \(s\) is language-specific parameter

    • Simply: \(f(k;s,N) \propto 1/k^s\)

A statistical property of language

 

Discrete version of power law

Menti

  • In a large Spanish text corpus, if we know the most popular word’s frequency is 145,872, what is your best estimate of its second most popular word’s frequency?

Tokenization/Segmentation

  • Split text into words and sentences

Tokenization/Segmentation

  • Split text into words and sentences

Tokenization/Segmentation

  • Split text into words and sentences

Tokenization/Segmentation

  • Split text into words and sentences

    • Task: what is the most likely segmentation /tokenization?

In Python

Named entity recognition

 

  • Determine text mapping to proper names

Named entity recognition

  • Determine text mapping to proper names

Named entity recognition

  • Determine text mapping to proper names

    • Task: what is the most likely mapping

In Python

Part Of Speech (POS) Tagging

  • Annotate each word in a sentence with a part-of-speech.

  • Useful for subsequent syntactic parsing and word sense disambiguation.

In Python

Zipf’s law tells us

  • Head words take large portion of occurrences, but they are semantically meaningless

    • E.g., the, a, an, we, do, to

Zipf’s law tells us

  • Head words take large portion of occurrences, but they are semantically meaningless

    • E.g., the, a, an, we, do, to
  • Tail words take major portion of vocabulary, but they rarely occur in documents

    • E.g., sesquipedalianism

Zipf’s law tells us

  • Head words take large portion of occurrences, but they are semantically meaningless

    • E.g., the, a, an, we, do, to
  • Tail words take major portion of vocabulary, but they rarely occur in documents

    • E.g., sesquipedalianism
  • The rest is most representative

    • To be included in the controlled vocabulary

Automatic document representation

Automatic document representation

Automatic document representation

Normalization

  • Convert different forms of a word to a normalized form in the vocabulary

    • U.S.A. -> USA, St. Louis -> Saint Louis

Normalization

  • Convert different forms of a word to a normalized form in the vocabulary

    • U.S.A. -> USA, St. Louis -> Saint Louis
  • Solution

    • Rule-based

      • Delete periods and hyphens

      • All in lower cases

    • Dictionary-based

      • Construct equivalent class

        • Car -> “automobile, vehicle”

        • Mobile phone -> “cellphone”

Normalization

  • Convert different forms of a word to a normalized form in the vocabulary

    • U.S.A. -> USA, St. Louis -> Saint Louis
  • Solution

    • Rule-based

      • Delete periods and hyphens

      • All in lower cases

    • Dictionary-based

      • Construct equivalent class ← We will come back to this later

        • Car -> “automobile, vehicle”

        • Mobile phone -> “cellphone”

In Python

Stemming

  • Reduce inflected or derived words to their root form

    • Plurals, adverbs, inflected word forms

      • E.g., ladies -> lady, referring -> refer, forgotten -> forget
    • Bridge the vocabulary gap

Stemming

  • Reduce inflected or derived words to their root form

    • Plurals, adverbs, inflected word forms

      • E.g., ladies -> lady, referring -> refer, forgotten -> forget
    • Bridge the vocabulary gap

    • Solutions (for English)

      • Porter stemmer: patterns of vowel-consonant sequence

      • Krovetz stemmer: morphological rules

Stemming

  • Reduce inflected or derived words to their root form

    • Plurals, adverbs, inflected word forms

      • E.g., ladies -> lady, referring -> refer, forgotten -> forget
    • Bridge the vocabulary gap

    • Solutions (for English)

      • Porter stemmer: patterns of vowel-consonant sequence

      • Krovetz stemmer: morphological rules

    • Risk: lose precise meaning of the word

      • E.g., lay -> lie (a false statement? or be in a horizontal position?)

In Python

Stopwords

  • Useless words for document analysis

    • Not all words are informative

    • Remove such words to reduce vocabulary size

Stopwords

  • Useless words for document analysis

    • Not all words are informative

    • Remove such words to reduce vocabulary size

    • No universal definition

Stopwords

  • Useless words for document analysis

    • Not all words are informative

    • Remove such words to reduce vocabulary size

    • No universal definition

    • Risk: break the original meaning and structure of text

      • E.g., this is not a good option -> option


      to be or not to be -> null

In Python

Recap: a statistical property of language

Recap: a statistical property of language

Discrete version of power law

Constructing a VSM representation

D1: ‘Text mining is to identify useful information.’

Constructing a VSM representation

D1: ‘Text mining is to identify useful information.’
  1. Tokenization:

Constructing a VSM representation

D1: ‘Text mining is to identify useful information.’
  1. Tokenization:
  2. D1: ‘Text’, ‘mining’, ‘is’, ‘to’, ‘identify’, ‘useful’, ‘information’, ‘.’

Constructing a VSM representation

D1: ‘Text mining is to identify useful information.’
  1. Tokenization:
  2. D1: ‘Text’, ‘mining’, ‘is’, ‘to’, ‘identify’, ‘useful’, ‘information’, ‘.’

  3. Stemming/normalization:

Constructing a VSM representation

D1: ‘Text mining is to identify useful information.’
  1. Tokenization:
  2. D1: ‘Text’, ‘mining’, ‘is’, ‘to’, ‘identify’, ‘useful’, ‘information’, ‘.’

  3. Stemming/normalization:
  4. D1: ‘text’, ‘mine’, ‘is’, ‘to’, ‘identify’, ‘use’, ‘inform’, ‘.’

Constructing a VSM representation

D1: ‘Text mining is to identify useful information.’
  1. Tokenization:
  2. D1: ‘Text’, ‘mining’, ‘is’, ‘to’, ‘identify’, ‘useful’, ‘information’, ‘.’

  3. Stemming/normalization:
  4. D1: ‘text’, ‘mine’, ‘is’, ‘to’, ‘identify’, ‘use’, ‘inform’, ‘.’

  5. N-gram construction:

Constructing a VSM representation

D1: ‘Text mining is to identify useful information.’
  1. Tokenization:
  2. D1: ‘Text’, ‘mining’, ‘is’, ‘to’, ‘identify’, ‘useful’, ‘information’, ‘.’

  3. Stemming/normalization:
  4. D1: ‘text’, ‘mine’, ‘is’, ‘to’, ‘identify’, ‘use’, ‘inform’, ‘.’

  5. N-gram construction:
  6. D1: ‘text-mine’, ‘mine-is’, ‘is-to’, ‘to-identify’, ‘identify-use’, ‘use-inform’, ‘inform-.’

Constructing a VSM representation

D1: ‘Text mining is to identify useful information.’
  1. Tokenization:
  2. D1: ‘Text’, ‘mining’, ‘is’, ‘to’, ‘identify’, ‘useful’, ‘information’, ‘.’

  3. Stemming/normalization:
  4. D1: ‘text’, ‘mine’, ‘is’, ‘to’, ‘identify’, ‘use’, ‘inform’, ‘.’

  5. N-gram construction:
  6. D1: ‘text-mine’, ‘mine-is’, ‘is-to’, ‘to-identify’, ‘identify-use’, ‘use-inform’, ‘inform-.’

  7. Stopword/controlled vocabulary filtering::

Constructing a VSM representation

D1: ‘Text mining is to identify useful information.’
  1. Tokenization:
  2. D1: ‘Text’, ‘mining’, ‘is’, ‘to’, ‘identify’, ‘useful’, ‘information’, ‘.’

  3. Stemming/normalization:
  4. D1: ‘text’, ‘mine’, ‘is’, ‘to’, ‘identify’, ‘use’, ‘inform’, ‘.’

  5. N-gram construction:
  6. D1: ‘text-mine’, ‘mine-is’, ‘is-to’, ‘to-identify’, ‘identify-use’, ‘use-inform’, ‘inform-.’

  7. Stopword/controlled vocabulary filtering::
  8. D1: ‘text-mine’, ‘to-identify’, ‘identify-use’, ‘use-inform’

How to assign weights?

  • Important!
  • Why?
    • Corpus-wise: some terms carry more information about the document content
    • Document-wise: not all terms are equally important

How to assign weights?

  • Important!
  • Why?
    • Corpus-wise: some terms carry more information about the document content
    • Document-wise: not all terms are equally important
  • How?
    • Two basic heuristics
      • TF (Term Frequency) = Within-doc-frequency
      • IDF (Inverse Document Frequency)

Binary representation

Term frequency

 

  • Idea: a term is more important if it occurs more frequently in a document

  • TF Formulas

    • Let \(c(t,d)\) be the frequency count of term \(t\) in doc \(d\)

    • Raw TF: \(tf(t,d) =c(t,d)\)

Term frequency

 

  • Idea: a term is more important if it occurs more frequently in a document
  • TF Formulas
    • Let \(c(t,d)\) be the frequency count of term \(t\) in doc \(d\)
    • Raw TF: \(tf(t,d) =c(t,d)\)

TF normalization

  • Two views of document length
    • A doc is long because it is verbose
    • A doc is long because it has more content

TF normalization

  • Two views of document length
    • A doc is long because it is verbose
    • A doc is long because it has more content
  • Raw TF is inaccurate
    • Document length variation
    • “Repeated occurrences” are less informative than the “first occurrence”
    • Information about semantic does not increase proportionally with number of term occurrence

TF normalization

  • Two views of document length
    • A doc is long because it is verbose
    • A doc is long because it has more content
  • Raw TF is inaccurate
    • Document length variation
    • “Repeated occurrences” are less informative than the “first occurrence”
    • Information about semantic does not increase proportionally with number of term occurrence
  • Generally penalize long document, but avoid over-penalizing

TF normalization

  • Two views of document length
    • A doc is long because it is verbose
    • A doc is long because it has more content
  • Raw TF is inaccurate
    • Document length variation
    • “Repeated occurrences” are less informative than the “first occurrence”
    • Information about semantic does not increase proportionally with number of term occurrence
  • Generally penalize long document, but avoid over-penalizing
    • Pivoted length normalization

TF normalization

  • Maximum TF scaling
    • \(tf(t,d) = \alpha + (1 - \alpha)\frac{c(t,d}{\underset{t}{\operatorname{max}} {c(t,d)}}\), if \(c(t,d)>0\)
    • Normalize by the most frequent word in this doc

TF normalization

  • Maximum TF scaling
    • \(tf(t,d) = \alpha + (1 - \alpha)\frac{c(t,d}{\underset{t}{\operatorname{max}} {c(t,d)}}\), if \(c(t,d)>0\)
    • Normalize by the most frequent word in this doc

TF normalization

  • Sub-linear TF scaling
    • \[tf(t,d)= \left\{ \begin{array}{**rcl**} 1 + logc(t,d), if \ c(t,d)>0 & \\ 0, \ \ otherwise& \end{array} \right. \]

In Python

Document frequency

  • Idea: a term is more discriminative if it occurs only in fewer documents

In Python

Inverse document frequency

  • Solution
    • Assign higher weights to rare terms
    • Formula
      • \(IDF(t) = 1 + log(\frac{N}{df(t)})\)

Inverse document frequency

  • Solution
    • Assign higher weights to rare terms
    • Formula
      • \(IDF(t) = 1 + log(\frac{N}{df(t)})\)
      \(log\): Non-linear scaling; \(N\): total number of documents; \(df(t)\): number of docs containing term \(t\)

Inverse document frequency

 

  • Solution
    • Assign higher weights to rare terms
    • Formula
      • \(IDF(t) = 1 + log(\frac{N}{df(t)})\)
      \(log\): Non-linear scaling; \(N\): total number of documents; \(df(t)\): number of docs containing term \(t\)
    • A corpus-specific property
      • Independent of a single document

Menti

  • If we remove one document from the corpus, how would it affect the IDF of words in the vocabulary?

Menti

  • If we remove one document from the corpus, how would it affect the IDF of words in the vocabulary?
  • If we add one document from the corpus, how would it affect the IDF of words in the vocabulary?

Why document frequency

  • How about total term frequency?
    • \(ttf(t) = \sum_d{c(t,d)}\)

Why document frequency

  • How about total term frequency?
    • \(ttf(t) = \sum_d{c(t,d)}\)

  • Cannot recognize words frequently occurring in a subset of documents

TF-IDF weighting

  • Combining TF and IDF
    • Common in doc → high tf → high weight
    • Rare in collection → high idf → high weight
    • \(w(t,d)=TF(t,d) \times IDF(t)\)

TF-IDF weighting

  • Combining TF and IDF
    • Common in doc → high tf → high weight
    • Rare in collection → high idf → high weight
    • \(w(t,d)=TF(t,d) \times IDF(t)\)
  • Most well-known document representation schema in IR! (G Salton et al. 1983)

TF-IDF weighting

  • Combining TF and IDF
    • Common in doc → high tf → high weight
    • Rare in collection → high idf → high weight
    • \(w(t,d)=TF(t,d) \times IDF(t)\)
  • Most well-known document representation schema in IR! (G Salton et al. 1983)

TF-IDF weighting

  • Combining TF and IDF
    • Common in doc → high tf → high weight
    • Rare in collection → high idf → high weight
    • \(w(t,d)=TF(t,d) \times IDF(t)\)
  • Most well-known document representation schema in IR! (G Salton et al. 1983)

How to define a good similarity metric?

How to define a good similarity metric?

  • Euclidean distance?

How to define a good similarity metric?

  • Euclidean distance
    • \(dist(d_i, d_j) = \sqrt{\sum_{t\in V}{[tf(t,d_i)idf(t) - tf(t, d_j)idf(t)]^2}}\)

How to define a good similarity metric?

  • Euclidean distance
    • \(dist(d_i, d_j) = \sqrt{\sum_{t\in V}{[tf(t,d_i)idf(t) - tf(t, d_j)idf(t)]^2}}\)
    • Longer documents will be penalized by the extra words

How to define a good similarity metric?

  • Euclidean distance
    • \(dist(d_i, d_j) = \sqrt{\sum_{t\in V}{[tf(t,d_i)idf(t) - tf(t, d_j)idf(t)]^2}}\)
    • Longer documents will be penalized by the extra words
    • We care more about how these two vectors are overlapped

From distance to angle

From distance to angle

From distance to angle

From distance to angle

  • Angle: how vectors are overlapped
    • Cosine similarity – projection of one vector onto another

Cosine similarity

  • Angle between two vectors
    • \(cosine(d_i, d_j) = \frac{V_{d_i}^TV_{d_j}}{|V_{d_i}|_2 \times |V_{d_j}|_2}\) ← TF-IDF vector

Cosine similarity

  • Angle between two vectors

    • \(cosine(d_i, d_j) = \frac{V_{d_i}^TV_{d_j}}{|V_{d_i}|_2 \times |V_{d_j}|_2}\) ← TF-IDF vector

  • Documents are normalized by length

Cosine similarity

  • Angle between two vectors

    • \(cosine(d_i, d_j) = \frac{V_{d_i}^TV_{d_j}}{|V_{d_i}|_2 \times |V_{d_j}|_2}\) ← TF-IDF vector

  • Documents are normalized by length

Cosine similarity

  • Angle between two vectors

    • \(cosine(d_i, d_j) = \frac{V_{d_i}^TV_{d_j}}{|V_{d_i}|_2 \times |V_{d_j}|_2}\) ← TF-IDF vector

  • Documents are normalized by length

Advantages and disadvantages of VS model

  • Empirically effective!
  • Intuitive
  • Easy to implement
  • Well-studied/mostly evaluated
  • The Smart system
    • Developed at Cornell: 1960-1999
    • Still widely used

Advantages and disadvantages of VS model

  • Empirically effective!
  • Intuitive
  • Easy to implement
  • Well-studied/mostly evaluated
  • The Smart system
    • Developed at Cornell: 1960-1999
    • Still widely used
  • Warning: many variants of TF-IDF!

Disadvantages of VS model

  • Assume term independence

Disadvantages of VS model

  • Assume term independence
  • Lack of “predictive adequacy”
    • Arbitrary term weighting
    • Arbitrary similarity measure

Disadvantages of VS model

  • Assume term independence
  • Lack of “predictive adequacy”
    • Arbitrary term weighting
    • Arbitrary similarity measure
  • Lots of parameter tuning!

Regular expressions

Menti

Other examples (from different disciplines)

  • SALTClass
  • ICD Classification
  • ASReview

Summary

Summary: what did we learn?

Time for Practical 1!